AITopics | multimodal environment

Collaborating Authors

multimodal environment

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CrafText Benchmark: Advancing Instruction Following in Complex Multimodal Open-Ended World

Volovikova, Zoya, Gorbov, Gregory, Kuderov, Petr, Panov, Aleksandr I., Skrynnik, Alexey

arXiv.org Artificial IntelligenceMay-20-2025

Following instructions in real-world conditions requires the ability to adapt to the world's volatility and entanglement: the environment is dynamic and unpredictable, instructions can be linguistically complex with diverse vocabulary, and the number of possible goals an agent may encounter is vast. Despite extensive research in this area, most studies are conducted in static environments with simple instructions and a limited vocabulary, making it difficult to assess agent performance in more diverse and challenging settings. To address this gap, we introduce CrafText, a benchmark for evaluating instruction following in a multimodal environment with diverse instructions and dynamic interactions. CrafText includes 3,924 instructions with 3,423 unique words, covering Localization, Conditional, Building, and Achievement tasks. Additionally, we propose an evaluation protocol that measures an agent's ability to generalize to novel instruction formulations and dynamically evolving task configurations, providing a rigorous test of both linguistic understanding and adaptive decision-making.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.11962

Country: Europe > Russia (0.14)

Genre:

Research Report (0.64)
Workflow (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
(2 more...)

Add feedback

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

Kagaya, Tomoyuki, Yuan, Thong Jing, Lou, Yuxuan, Karlekar, Jayashree, Pranata, Sugiri, Kinose, Akira, Oguri, Koki, Wick, Felix, You, Yang

arXiv.org Artificial IntelligenceFeb-5-2024

Owing to recent advancements, Large Language Models (LLMs) can now be deployed as agents for increasingly complex decision-making applications in areas including robotics, gaming, and API integration. However, reflecting past experiences in current decision-making processes, an innate human behavior, continues to pose significant challenges. Addressing this, we propose Retrieval-Augmented Planning (RAP) framework, designed to dynamically leverage past experiences corresponding to the current situation and context, thereby enhancing agents' planning capabilities. RAP distinguishes itself by being versatile: it excels in both text-only and multimodal environments, making it suitable for a wide range of tasks. Empirical evaluations demonstrate RAP's effectiveness, where it achieves SOTA performance in textual scenarios and notably enhances multimodal LLM agents' performance for embodied tasks. These results highlight RAP's potential in advancing the functionality and applicability of LLM agents in complex, real-world applications.

agent, contextual memory, retrieval-augmented planning, (16 more...)

arXiv.org Artificial Intelligence

2402.0361

Country:

Europe > Germany (0.04)
Oceania > Australia (0.04)
North America > United States (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)

Add feedback

Syntax-Guided Transformers: Elevating Compositional Generalization and Grounding in Multimodal Environments

Kamali, Danial, Kordjamshidi, Parisa

arXiv.org Artificial IntelligenceNov-7-2023

Compositional generalization, the ability of intelligent models to extrapolate understanding of components to novel compositions, is a fundamental yet challenging facet in AI research, especially within multimodal environments. In this work, we address this challenge by exploiting the syntactic structure of language to boost compositional generalization. This paper elevates the importance of syntactic grounding, particularly through attention masking techniques derived from text input parsing. We introduce and evaluate the merits of using syntactic information in the multimodal grounding problem. Our results on grounded compositional generalization underscore the positive impact of dependency parsing across diverse tasks when utilized with Weight Sharing across the Transformer encoder. The results push the state-of-the-art in multimodal grounding and parameter-efficient modeling and provide insights for future research.

elevating compositional generalization and grounding, multimodal environment, syntax-guided transformer

arXiv.org Artificial Intelligence

2311.04364

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.93)

Add feedback

Facebook's Captum brings explainability to machine learning

#artificialintelligenceOct-11-2019, 17:36:41 GMT

Facebook today introduced Captum, a library for explaining decisions made by neural networks with deep learning framework PyTorch. Captum is designed to implement state of the art versions of AI models like Integrated Gradients, DeepLIFT, and Conductance. Captum allows researchers and developers to interpret decisions made in multimodal environments that combine, for example, text, images, and video, and allows them to compare results to existing models within the library. Developers can also use Captum to understand feature importance or perform a deep dive on neural networks to understand neuron and layer attributions. The tool will also launch with Captum Insights, a visualization tool for visual representations of Captum results.

captum bring explainability, facebook, multimodal environment, (7 more...)

#artificialintelligence

Country: North America > United States > California > San Francisco County > San Francisco (0.06)

Industry: Information Technology > Services (0.68)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback